Skip to content

fix(deps): subdir-agnostic bare cache fixes parallel sparse-checkout race (#1126)#1135

Merged
danielmeppiel merged 5 commits intomainfrom
fix/1126-bare-cache-materialization
May 4, 2026
Merged

fix(deps): subdir-agnostic bare cache fixes parallel sparse-checkout race (#1126)#1135
danielmeppiel merged 5 commits intomainfrom
fix/1126-bare-cache-materialization

Conversation

@danielmeppiel
Copy link
Copy Markdown
Collaborator

@danielmeppiel danielmeppiel commented May 4, 2026

Fixes #1126.

TL;DR

The shared-clone cache (WS2, introduced by PR #1116 as part of the install perf overhaul) was keyed on (host, owner, repo, ref, subdir) and stored subdir-pinned working trees populated via git checkout -- <subdir>. When two dependencies pointed at the same upstream repo@ref but different subdirectories, parallel installs raced through the cache: consumer A's checkout could be overwritten by consumer B's checkout before A finished copying its files, intermittently raising:

RuntimeError: Subdirectory '<path>' not found in repository

This PR rebuilds the cache around the same paradigm WS3's GitCache._create_checkout already uses successfully: the cache stores subdir-agnostic BARE clones, keyed only on (host, owner, repo, ref). Each consumer materializes its own private working tree from the bare via git clone --local --shared --no-checkout + git checkout HEAD. Sibling consumers no longer share mutable working-tree state, so the race goes away by construction.

Problem (WHY)

PR #1116 introduced SharedCloneCache to dedup git clones across the dependency graph during apm install. The original design cached the post-checkout working tree keyed on (host, owner, repo, ref, subdir). The race manifested when two deps shared (host, owner, repo, ref) but specified different subdirs:

  1. Consumer A acquires the per-key lock for (host, owner, repo, ref, "skills/X"), populates its checkout, releases the lock, and starts copying files out.
  2. Consumer B (different subdir, different cache key) acquires the upstream clone via the same shared underlying repo and runs git checkout -- skills/Y, mutating the working tree A is still reading from.
  3. A's os.path.exists(repo_dir / "skills/X") check now fails because B's checkout truncated/overwrote files A expected.

The per-key lock on SharedCloneCache was correct for cache population, but did not protect against post-population mutation when two cache keys map to the same on-disk repository. The fix is structural: stop sharing post-checkout state at all.

Approach (WHAT)

Mirror the WS3 GitCache pattern that already proves out in production:

  1. SharedCloneCache keys on (host, owner, repo, ref) only. No subdir component. Two deps that share repo+ref get the same cache entry.
  2. The cached value is a BARE clone, not a working tree. Bare clones are read-only fixtures: nothing mutates them post-population.
  3. Each consumer materializes its own private working tree from the bare via git clone --local --shared --no-checkout <bare> <consumer> + git checkout HEAD. The --local --shared mode reuses the bare's object store (no disk-space cost) but gives each consumer an isolated working tree and HEAD.
  4. Subdir extraction happens against the consumer's private tree, never against the shared bare.

This is the exact pattern in src/apm_cli/cache/git_cache.py::_create_checkout. We are not inventing a new approach; we are unifying WS2 with WS3.

Implementation (HOW)

Components touched

flowchart TD
    A["download_subdirectory_package<br/>github_downloader.py"] --> B{WS2 use_shared?}
    B -->|Yes| C["SharedCloneCache.get_or_clone<br/>per-key entry lock"]
    C --> D{Already cached?}
    D -->|Yes| F["Return bare_path"]
    D -->|No| E["_bare_clone_with_fallback"]
    E --> G["_execute_transport_plan<br/>Strategy template"]
    G --> H{"_bare_action: is_commit_sha?"}
    H -->|Yes| I["Tier-1: git init --bare +<br/>fetch --depth=1 origin SHA"]
    I -->|Fail| J["rmtree partial +<br/>Tier-2: git clone --bare full"]
    H -->|No| K["Tier-1: git clone --bare<br/>--depth=1 --branch ref"]
    K -->|Fail| L["rmtree partial +<br/>Tier-2: git clone --bare full"]
    I -->|OK| M["git update-ref HEAD sha<br/>_scrub_bare_remote_url"]
    J --> M
    K -->|OK| N["_scrub_bare_remote_url"]
    L --> N
    M --> F
    N --> F
    F --> O["_materialize_from_bare"]
    O --> P["git rev-parse HEAD from bare"]
    P --> Q["git clone --local --shared<br/>--no-checkout"]
    Q --> R["git config core.autocrlf=false<br/>+ LFS smudge disable"]
    R --> S["git checkout HEAD"]
    S --> T["Return resolved_sha"]
    T --> U["Copy subdir to target_path"]
    B -->|No| V["Legacy per-dep clone path"]
Loading

Key functions

  • SharedCloneCache (shared_clone_cache.py): cache key narrowed to (host, owner, repo, ref). Bare temp-dir name changed to bare. Optional APM_DEBUG=1 invariant assert that the cached path is a bare repo.

  • _bare_clone_with_fallback (github_downloader.py): new entry point. Builds a transport plan, then plugs _bare_action (subprocess-based) into the shared _execute_transport_plan template (which the existing working-tree _clone_with_fallback also uses via _wt_action). Two-tier ladder per attempt: tier-1 attempts shallow (init+fetch for SHA refs, --depth=1 --branch for symbolic refs); on CalledProcessError rmtrees partial state and falls back to tier-2 full bare clone. After every successful clone, _scrub_bare_remote_url redacts remote.origin.url to redacted:// so any tokenized URL is removed from the bare's .git/config.

  • _materialize_from_bare (github_downloader.py): runs git rev-parse HEAD against the bare (--git-dir=<bare>) to obtain the resolved SHA without ever opening a Repo() on the consumer dir (avoids Windows handle leaks that block downstream rmtree). Then git clone --local --shared --no-checkout followed by git checkout HEAD.

  • _execute_transport_plan (refactored): the existing transport-plan iteration logic was extracted into a Strategy template that accepts a clone_action callable. _wt_action (working tree, GitPython-based) and _bare_action (bare, subprocess-based) plug in. ADO bearer 401 retry composes correctly with both strategies. Catches (GitCommandError, subprocess.CalledProcessError) because the two strategies raise different exception types.

  • _ws2_resolved_commit thread-through: the resolved SHA from _materialize_from_bare is threaded past the legacy Repo(temp_clone_path).head.commit.hexsha block, so the lockfile records the same commit the bare-cache path produced.

Trade-offs

  • Cache disk usage: A bare clone for one repo+ref is now shared across all consumers, where previously each (repo, ref, subdir) got its own working tree. Net win on disk (objects deduplicated via --local --shared).
  • Per-consumer materialize cost: Each consumer pays a git clone --local --shared + git checkout HEAD (single-process, local FS). On modern hardware this is sub-100ms even for large repos. Acceptable cost for correctness.
  • Two clone strategies (_wt_action + _bare_action) coexist temporarily. The legacy non-cache path still uses _wt_action; the WS2 path uses _bare_action. Both compose with the shared _execute_transport_plan. Future cleanup could converge on bare-only, but is out of scope here.
  • _bare_action closure is ~100 lines (panel feedback). Logic is correct; extracting to a named method for testability is a recommended follow-up.

Validation evidence

  • CI Lint job mirror: uv run --extra dev ruff check src/ tests/ and uv run --extra dev ruff format --check src/ tests/ both silent. Verified locally before push.
  • Unit suite: uv run pytest tests/unit/deps/test_shared_clone_cache.py -q -> 21 passed in 2.17s. Full unit run -> 7573 passed.
  • New tests (tests/unit/deps/test_shared_clone_cache.py):
    • TestBareCacheRaceCondition::test_parallel_different_subdirs_both_succeed -- regression trap for the original race; two parallel materializes against the same bare both succeed.
    • TestBareCloneFallback -- tier-1 -> tier-2 fallback for both SHA and symbolic refs; covers subprocess.CalledProcessError triggering rmtree-then-retry.
    • TestMaterializeFromBare::test_consumer_resolved_sha_obtained_from_bare_not_consumer -- regression trap that SHA is read from --git-dir=<bare>, not Repo(consumer_dir). Prevents Windows handle-leak regressions.
    • TestSharedCloneCacheBareInvariant -- APM_DEBUG=1 invariant: cached path must be a bare repo.
    • TestExecuteTransportPlanWtAction + TestBareCloneRetryRmtree -- rmtree-before-retry semantics on both _wt_action and _bare_action.
    • TestInvalidSubdirErrorWording::test_typo_subdir_raises_subdirectory_not_found -- the user-facing typo-detection error message ("Subdirectory '...' not found in repository") is preserved through the bare-cache path.
  • New E2E (tests/integration/test_install_subdir_dedup_e2e.py): 3-variant parametrized test (symbolic-https, default-branch, sha-https) that drives parallel download_subdirectory_package against github/awesome-copilot with two sibling subdirs, asserting both succeed and produce the same resolved_commit. Wired into scripts/test-integration.sh::run_e2e_tests. Skips when GITHUB_APM_PAT/GITHUB_TOKEN is absent. SHA for the sha-https variant is resolved at runtime via gh api (no fabricated constants).

How to test

# unit (no auth required)
cd /path/to/apm
uv run pytest tests/unit/deps/test_shared_clone_cache.py -v

# E2E (requires GITHUB_APM_PAT or GITHUB_TOKEN)
GITHUB_APM_PAT=<your-pat> uv run pytest tests/integration/test_install_subdir_dedup_e2e.py -v -s

# real-world reproduction (against any apm.yml with two subdir deps from same repo+ref):
apm install --verbose
# Pre-fix: intermittent "Subdirectory '<path>' not found in repository" on parallel installs.
# Post-fix: clean install every time.

Panel review

Apm-review-panel was run; full advisory comment posted above. Stance: ship_with_followups. Zero blocking findings across eight specialist personas. Two highest-signal in-PR follow-ups already addressed in commit d5f51012:

  • CHANGELOG entry under ## [Unreleased] > ### Fixed (flagged by doc-writer + oss-growth-hacker independently).
  • Reworded RuntimeError user-facing wording ("Failed to materialize working tree from shared bare" -> "Failed to prepare dependency from cached clone") per devx-ux-expert finding.

Remaining recommended follow-ups are post-merge improvements: FETCH_HEAD truncation in _scrub_bare_remote_url (defense-in-depth, supply-chain-security); ADO bearer 401 retry through _bare_action unit test (test-coverage); cross-path lockfile parity test (test-coverage); _bare_action closure extraction (python-architect).

…race (#1126)

Refactor SharedCloneCache (WS2) to store BARE clones keyed only on
(host, owner, repo, ref). Each consumer materializes its own working
tree from the bare via 'git clone --local --shared --no-checkout' +
'git checkout HEAD'. Mirrors WS3's GitCache._create_checkout pattern.

Pre-fix: two parallel subdir deps from the same upstream repo+ref
raced through the cache; one consumer would materialize its subdir
into the cached path, then the second consumer would find the cached
path without its own subdir, raising
RuntimeError("Subdirectory '...' not found in repository").

Post-fix: the cache is subdir-agnostic. Two consumers requesting
different subdirs of the same repo+ref share one bare without racing.
Per-consumer materialization runs against a unique mkdtemp path.

Implementation:
- _execute_transport_plan: Strategy template extracted from
  _clone_with_fallback; both _wt_action (existing GitPython path)
  and _bare_action (new subprocess-based 3-tier ladder) plug in.
- _bare_clone_with_fallback: 3-tier fallback for SHA refs
  (init+fetch / shallow / full), 2-tier for symbolic refs.
- _materialize_from_bare: per-consumer working tree with CRLF +
  LFS pinning; SHA resolved from --git-dir=<bare> rev-parse HEAD
  to avoid Windows file-handle leak (no Repo() on consumer dir).
- _scrub_bare_remote_url: redacts remote.origin.url to redacted://
  after every successful bare clone; logs at WARNING on failure
  (audit trail for token-leak-aware operators).
- WS2 callsite rewritten in download_subdirectory_package; threads
  _ws2_resolved_commit past the legacy SHA-capture branch.
- shared_clone_cache.py: docstring rewritten; APM_DEBUG-gated
  bare-shape invariant assert.

Tests:
- 21 unit tests in test_shared_clone_cache.py covering 18 of 19
  design scenarios (race, fallback ladder, materialize lifetime,
  rmtree, invariant, error wording).
- New parametrized E2E (test_install_subdir_dedup_e2e.py) with 3
  ref variants (symbolic-https / sha-https / default-branch);
  added to scripts/test-integration.sh.
- All 7573 unit tests pass; ruff check + ruff format --check pass.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@danielmeppiel
Copy link
Copy Markdown
Collaborator Author

APM Review Panel: ship_with_followups

Eliminates parallel sparse-checkout race in shared cache via subdir-agnostic bare clones -- 0 blocking findings, strong panel convergence on ship-readiness with one in-PR gap (CHANGELOG).

cc @danielmeppiel -- a fresh advisory pass is ready for your review.

All eight panelists converge: this PR correctly resolves a real intermittent failure that adopters hit in CI (the parallel sparse-checkout race from #1116). The architectural approach -- bare clones keyed on (host, owner, repo, ref) with per-consumer materialized working trees -- is sound, mirrors the existing GitCache._create_checkout pattern, and eliminates the shared-mutable-state root cause rather than papering over it with locks. Supply-chain-security and auth-expert both confirm the token-hygiene posture is improved (scrub eliminates on-disk persistence vector) with residual gaps bounded by 0700 tmpdir lifetime.

No panelist raised a blocking finding. The strongest cross-panel signal is the missing CHANGELOG entry, flagged independently by both doc-writer and oss-growth-hacker -- this is a repo convention violation per changelog.instructions.md and is trivially addressable in-PR before merge. The supply-chain findings (FETCH_HEAD retention, fail-closed on scrub failure) are defense-in-depth improvements worth tracking but do not represent exploitable vectors given the existing mitigations (0700 mode, short cache lifetime, rmtree on failure).

Test-coverage-expert confirms five surfaces covered with passing evidence at unit and integration tiers, including the critical race-condition regression trap. The two missing-test findings (ADO bearer retry through bare path, lockfile cross-path parity) are real gaps but represent additive coverage on paths that have indirect E2E protection today. They do not block shipment.

Dissent. No real dissent. Doc-writer and oss-growth-hacker independently flagged the same CHANGELOG gap -- this is convergence, not disagreement. Supply-chain-security and auth-expert overlap on the tier-1 token window finding with compatible recommendations (scrub FETCH_HEAD vs scrub-before-rmtree in except clause). Both are valid defense-in-depth; neither is exploitable today.

Aligned with: Secure by default (token-hygiene improved; residual FETCH_HEAD gap bounded by 0700 + short lifetime); Pragmatic as npm (parallel install of multiple subdirs from same repo+ref now works reliably); OSS community-driven (fixes intermittent CI failure adopters hit; trust-building fix).

Growth signal. Release beat opportunity: "parallel install reliability" is a trust-building narrative for adopters evaluating APM against manual git-submodule workflows. When this ships, release notes should lead with the user-visible symptom ("apm install no longer fails intermittently when two packages reference different subdirs of the same repo") rather than the internal bare-clone refactor. This is the kind of invisible-but-critical reliability story that converts evaluators into committed users.

Panel summary

Persona B R N Takeaway
Python Architect 0 2 2 Template Method extraction of _execute_transport_plan is well-factored; bare-action tier ladder is large but correct; no blocking issues.
CLI Logging Expert 0 0 2 Logging surface is minimal and correct: stdlib logger for internal audit trail, no token leakage paths, ASCII-clean wording.
DevX UX Expert 0 1 2 Internal cache mechanics only; no CLI surface/flag/help/error-wording regressions. Typo-detection promise preserved with regression-trap test.
Supply Chain Security Expert 0 2 1 Solid token-hygiene improvement. _scrub_bare_remote_url eliminates the primary on-disk token persistence vector. Residual: FETCH_HEAD in tier-1.
OSS Growth Hacker 0 1 1 Bug-fix PR with strong story potential but missing CHANGELOG entry under [Unreleased].
Auth Expert 0 1 2 Auth paths are sound: _execute_transport_plan composes correctly with both _wt_action and _bare_action; ADO bearer flows through; no AuthResolver bypass.
Doc Writer 0 1 1 PR fixes a user-observable bug but does not add a CHANGELOG entry. One Fixed line is required. No other doc surface affected.
Test Coverage Expert 0 2 1 Race-condition regression trap, error wording, Windows handle-leak guard, and lockfile within-cache parity all covered. ADO bearer + bare path test missing.

B = blocking-severity findings, R = recommended, N = nits.
Counts are signal strength, not gates. The maintainer ships.

Top 5 follow-ups

  1. [Doc Writer + OSS Growth Hacker] Add CHANGELOG entry under ## [Unreleased] > Fixed for [BUG] Parallel sparse-checkout race condition when two subdirectory deps reference the same repository #1126 -- Repo convention requires every merged PR with code changes to have a changelog entry. Two panelists flagged independently. Addressable in-PR before merge -- one line.
  2. [DevX UX Expert] Reword RuntimeError("Failed to materialize working tree from shared bare: ...") to user-facing language -- Internal jargon ("shared bare") leaks to users on failure path. Quick in-PR fix: "Failed to prepare dependency from cached clone: {e}".
  3. [Supply Chain Security Expert] Truncate FETCH_HEAD inside _scrub_bare_remote_url to eliminate residual tokenized URL -- Defense-in-depth: FETCH_HEAD retains tokenized URL after tier-1 init+fetch. Not exploitable today (0700 tmpdir) but cheap to close. Follow-up issue or in-PR.
  4. [Test Coverage Expert] Add unit test for ADO bearer 401 retry through _bare_action composition -- Missing test on a security-relevant path (bearer retry). Existing E2E covers _wt_action but not _bare_action. Post-merge follow-up issue.
  5. [Test Coverage Expert] Harden lockfile cross-path parity assertion (replace soft 'unknown' guard with hard assert) -- Current E2E silently passes if SHA extraction returns 'unknown' -- regression trap has a silent escape hatch. Post-merge hardening.

Architecture

classDiagram
    direction LR
    class SharedCloneCache {
      <<Concurrency Primitive>>
      +get_or_clone(host, owner, repo, ref, clone_fn) Path
      +cleanup() None
      -_entries dict
      -_lock Lock
    }
    class _CacheEntry {
      <<ValueObject>>
      +lock Lock
      +path Path or None
      +error Exception or None
    }
    class GitHubPackageDownloader {
      <<Facade>>
      +download_subdirectory_package(...)
      -_clone_with_fallback(...) Repo
      -_execute_transport_plan(...) None
      -_bare_clone_with_fallback(...) None
      -_materialize_from_bare(...) str
    }
    class TransportSelector {
      <<Strategy>>
      +plan_for(url_form, prefs) TransportPlan
    }
    class TransportPlan {
      <<ValueObject>>
      +attempts list~TransportAttempt~
    }
    class GitCache {
      <<Persistent Cache>>
      +_create_checkout(shard_key, sha, url, env) Path
    }
    SharedCloneCache *-- _CacheEntry : per-key
    GitHubPackageDownloader ..> SharedCloneCache : WS2 path uses
    GitHubPackageDownloader ..> GitCache : WS3 path uses
    GitHubPackageDownloader ..> TransportSelector : resolves plan
    GitHubPackageDownloader ..> TransportPlan : iterates attempts
    note for GitHubPackageDownloader "Template Method: _execute_transport_plan\naccepts clone_action callable (wt or bare)"
    note for SharedCloneCache "Double-checked locking:\nglobal lock for entry creation,\nper-entry lock for clone-once"
    class SharedCloneCache:::touched
    class GitHubPackageDownloader:::touched
    classDef touched fill:#fff3b0,stroke:#d47600
Loading
flowchart TD
    A["download_subdirectory_package<br/>src/apm_cli/deps/github_downloader.py"] --> B{WS2 use_shared?}
    B -->|Yes| C["SharedCloneCache.get_or_clone<br/>per-key entry lock"]
    C --> D{Already cached?}
    D -->|Yes| F["Return bare_path"]
    D -->|No| E["_bare_clone_with_fallback"]
    E --> G["_execute_transport_plan<br/>TransportSelector plan iteration"]
    G --> H{"_bare_action: is_commit_sha?"}
    H -->|Yes| I["Tier-1: git init --bare +<br/>fetch --depth=1 origin SHA"]
    I -->|Fail| J["rmtree partial +<br/>Tier-2: git clone --bare full"]
    H -->|No| K["Tier-1: git clone --bare<br/>--depth=1 --branch ref"]
    K -->|Fail| L["rmtree partial +<br/>Tier-2: git clone --bare full"]
    I -->|OK| M["git update-ref HEAD sha<br/>_scrub_bare_remote_url"]
    J --> M
    K -->|OK| N["_scrub_bare_remote_url"]
    L --> N
    M --> F
    N --> F
    F --> O["_materialize_from_bare"]
    O --> P["git rev-parse HEAD from bare"]
    P --> Q["git clone --local --shared --no-checkout"]
    Q --> R["git config core.autocrlf=false<br/>+ LFS smudge disable"]
    R --> S["git checkout HEAD"]
    S --> T["Return resolved_sha"]
    T --> U["Copy subdir to target_path<br/>robust_copytree"]
    B -->|No| V["Legacy per-dep clone path"]
Loading

Recommendation

Ship after adding the CHANGELOG entry (one line, in-PR) and optionally rewording the RuntimeError message. Zero blocking findings across eight specialists, strong test coverage with passing regression trap at multiple tiers, and improved security posture. The FETCH_HEAD scrub and ADO bearer test gap are legitimate defense-in-depth items for a follow-up issue but do not represent current exploitable risk. The maintainer should merge once the CHANGELOG line lands.


Full per-persona findings

Python Architect

  • [recommended] _bare_action closure is ~100 lines of nested tier logic; extract to a named method for testability at src/apm_cli/deps/github_downloader.py:1020
    The _bare_action closure inside _bare_clone_with_fallback captures git_exe, is_commit_sha, and ref from the enclosing scope and contains two branching tier ladders. At ~100 lines it exceeds the inline-when-truly-unique heuristic. Extracting to a private method like _bare_tier_ladder(url, env, target, git_exe, ref, is_commit_sha) would make the tier logic independently unit-testable.
    Suggested: Extract _bare_action body to a standalone method: def _execute_bare_tiers(self, url, env, target, git_exe, ref, is_commit_sha) -> None.
  • [recommended] Duplicated git subprocess patterns (init+fetch+update-ref+scrub) could use a small builder or helper at src/apm_cli/deps/github_downloader.py:1061
    The update-ref + scrub suffix is repeated 3 times. A small helper _finalize_bare(target, git_exe, env, sha=None) would DRY the call sites. Not blocking because the repetition is currently bounded.
  • [nit] repo_holder: list[Repo] = [] pattern in _clone_with_fallback is an idiomatic Python closure workaround but a nonlocal would be clearer at src/apm_cli/deps/github_downloader.py:648
    The list-as-mutable-cell pattern is functional but non-obvious. Using nonlocal repo_result inside the nested function is more explicit. Trivial change, zero behavioral impact.
  • [nit] _materialize_from_bare duplicates the clone+config+checkout sequence from GitCache._create_checkout without sharing code at src/apm_cli/deps/github_downloader.py:1186
    Both run the same 3-step pattern. A shared utility in core/ would unify the pattern. However, the two live in different dependency layers and the duplication is bounded to ~30 lines, so this is a nit not a recommended -- extract only when a third consumer appears.

CLI Logging Expert

  • [nit] Warning message says 'Tokenized URL may persist on disk until shared cache cleanup' -- consider saying 'until process exit' since cache is instance-scoped at src/apm_cli/deps/github_downloader.py:136
    'shared cache cleanup' is technically accurate but suggests an external mechanism the operator must invoke. 'until install completes' or 'until process exit' matches the actual lifecycle more precisely.
    Suggested: Replace 'until shared cache cleanup' with 'until install run completes' in both warning messages (lines 136 and 144).
  • [nit] Second warning interpolates raw Exception repr (%s on exc) -- consider exc_info=True for richer operator debug at src/apm_cli/deps/github_downloader.py:140
    For a security-audit log, operators benefit from a full traceback, not just the repr. logging.warning(..., exc_info=True) adds the traceback at no cost when log level permits.

DevX UX Expert

  • [recommended] New RuntimeError('Failed to materialize working tree from shared bare: ...') leaks internal jargon to users at src/apm_cli/deps/github_downloader.py:617
    The error message is written for maintainers, not users. A user hitting this sees 'bare', 'materialize', 'working tree' -- vocabulary from git internals, not APM's install mental model. Recovery guidance is also absent.
    Suggested: Reword to: RuntimeError(f"Failed to prepare dependency from cached clone: {e}") or include a recovery hint.
  • [nit] WARNING log about token redaction failure may confuse users running with verbose logging at src/apm_cli/deps/github_downloader.py:74
    The warning sounds like a security alert the user must act on, but per the code this is best-effort and the cache is ephemeral. Consider downgrading to DEBUG or softening the language.
  • [nit] APM_DEBUG-only invariant error uses internal paths in the message at src/apm_cli/deps/shared_clone_cache.py:706
    Printing raw temp-dir paths is an anti-pattern. Since it is gated by APM_DEBUG this is acceptable for developer use, but worth noting for consistency.

Supply Chain Security Expert

  • [recommended] FETCH_HEAD retains tokenized URL after tier-1 init+fetch (not scrubbed) at src/apm_cli/deps/github_downloader.py:1051
    After git fetch --depth=1 origin <sha>, git writes FETCH_HEAD containing the tokenized URL. _scrub_bare_remote_url only redacts remote.origin.url in config; FETCH_HEAD is left untouched. The bare is 0700 and ephemeral, and _materialize_from_bare uses --local --shared so the consumer's clone won't propagate FETCH_HEAD content. Risk is low but for defense-in-depth consider truncating/removing <bare>/FETCH_HEAD inside _scrub_bare_remote_url.
  • [recommended] Best-effort scrub failure leaves token on disk until cache cleanup at src/apm_cli/deps/github_downloader.py:127
    _scrub_bare_remote_url uses check=False and catches all exceptions. Code is correctly logging the failure (audit trail). Consider whether a scrub failure on the primary path should escalate to deleting the bare and re-raising -- this would make the operation fail-closed rather than fail-open with a warning.
  • [nit] Tier-1 window: token in .git/config between remote-add and scrub at src/apm_cli/deps/github_downloader.py:1044
    Between git remote add origin <tokenized_url> and _scrub_bare_remote_url(target, ...), the token lives in the bare's config file on disk. Acceptable given the bare is mode-0700 in a per-session tmpdir, but worth documenting.

OSS Growth Hacker

  • [recommended] Missing CHANGELOG entry for [BUG] Parallel sparse-checkout race condition when two subdirectory deps reference the same repository #1126 fix at CHANGELOG.md:8
    The [Unreleased] section is empty. Per changelog.instructions.md, every merged PR that changes code must have an entry.
    Suggested: Add a Fixed entry: **Parallel subdir install race fix:** two subdirectory deps from the same upstream repo+ref no longer intermittently fail with "Subdirectory ... not found" during parallel apm install. (#1126)
  • [nit] Module docstring is implementation-heavy; consider a one-liner adopters can grep at src/apm_cli/deps/shared_clone_cache.py
    The opening sentence could double as a grep-friendly summary for users debugging install failures.

Auth Expert

  • [recommended] Tier-1 init+fetch stores tokenized URL in .git/config between remote-add and scrub; on CalledProcessError from fetch the scrub never runs at src/apm_cli/deps/github_downloader.py:1069
    The shutil.rmtree IS the correct mitigation but uses ignore_errors=True, so if deletion partially fails (Windows file-locking edge case), the tokenized URL persists. Window is narrow.
    Suggested: Add _scrub_bare_remote_url(target, git_exe, env) before shutil.rmtree in the except clause as defense-in-depth.
  • [nit] _materialize_from_bare receives env that's only used for local-only operations where no auth is needed at src/apm_cli/deps/github_downloader.py:1186
    git clone --local --shared never contacts a remote, so the auth env vars are dead weight here. Consistent with codebase pattern. Fine.
  • [nit] Integration test uses direct os.getenv for GITHUB_APM_PAT -- acceptable for test skip-logic but differs from production AuthResolver flow at tests/integration/test_install_subdir_dedup_e2e.py:811
    Standard pattern for integration tests. No issue.

Doc Writer

  • [recommended] Missing CHANGELOG entry for user-observable bug fix at CHANGELOG.md
    PR fix(deps): subdir-agnostic bare cache fixes parallel sparse-checkout race (#1126) #1135 changes code and tests but does not touch CHANGELOG.md. The fix eliminates an intermittent RuntimeError("Subdirectory '...' not found in repository") during apm install -- a user-visible failure mode.
    Suggested: Under ## [Unreleased], add a ### Fixed block: **Parallel subdir install race.** apm install no longer intermittently fails with RuntimeError: Subdirectory '<path>' not found in repository when multiple dependencies resolve to different subdirectories of the same repo@ref. (#1135, fixes #1126)
  • [nit] No user-facing doc surface affected -- confirmed at docs/src/content/docs/**
    Verified no docs/, README, MANIFESTO, agent/skill/instruction prose, or reference page describes this internal cache layout. No action.

Test Coverage Expert

  • [recommended] ADO bearer 401 retry through _bare_action has no unit test at tests/unit/deps/test_shared_clone_cache.py
    _execute_transport_plan's ADO bearer 401 retry now invokes clone_action with _bare_action. No test in tests/unit/deps/ exercises this specific composition. Existing integration test_ado_bearer_e2e.py only covers the working-tree path.
    Suggested: Add test_ado_bearer_retry_through_bare_action mocking the first attempt to raise CalledProcessError with '401' in stderr.
    Proof (missing): tests/unit/deps/test_shared_clone_cache.py::test_ado_bearer_retry_through_bare_action -- proves: ADO bearer 401 fallback composes correctly with _bare_action's rmtree-before-retry invariant [Secure by default, Multi-harness]
  • [recommended] Lockfile resolved_commit parity: no cross-path comparison (bare-cache vs legacy) at tests/integration/test_install_subdir_dedup_e2e.py
    The E2E asserts sha_a == sha_b within bare-cache path but no test compares against the legacy non-cache path for the same repo+ref. If a future refactor breaks SHA extraction, both sibling deps would get the same wrong SHA and the within-cache parity test would still pass.
    Suggested: Add an integration test variant comparing bare-cache vs legacy path; or strengthen the existing E2E to fail-fast when sha_a == 'unknown'.
    Proof (missing): tests/integration/test_install_subdir_dedup_e2e.py::test_resolved_commit_matches_legacy_path -- proves: Lockfile determinism across paths [Governed by policy, DevX]
  • [nit] E2E lockfile assertion uses soft guard that silently passes on 'unknown' at tests/integration/test_install_subdir_dedup_e2e.py:139
    if sha_a and sha_b and sha_a != 'unknown' means the parity assertion is entirely skipped if SHA extraction is broken. Defeats the stated test purpose.
    Suggested: Replace soft guard with assert sha_a and sha_a != 'unknown', 'SHA extraction failed' before the equality assertion.
    Proof (passed): tests/integration/test_install_subdir_dedup_e2e.py::test_two_subdirs_same_repo_parallel[symbolic-https] -- proves: Lockfile parity within bare-cache path (but soft guard means broken SHA extraction silently passes) [Governed by policy]

This panel is advisory. It does not block merge. Re-apply the
panel-review label after addressing feedback to re-run.

…ror wording)

Two highest-signal in-PR follow-ups from apm-review-panel on PR #1135:

1. Add CHANGELOG.md '### Fixed' entry under [Unreleased] for #1126.
   Flagged independently by doc-writer and oss-growth-hacker;
   required by .github/instructions/changelog.instructions.md.

2. Reword RuntimeError raised by _materialize_from_bare from
   'Failed to materialize working tree from shared bare: {e}' to
   'Failed to prepare dependency from cached clone: {e}' to remove
   internal-jargon ('materialize', 'working tree', 'shared bare')
   from a user-facing error message (devx-ux-expert finding).

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Reworks WS2 subdirectory dependency dedup to cache subdir-agnostic bare clones instead of mutable working trees, so parallel installs can materialize isolated consumer checkouts without racing each other. This fits the install pipeline by aligning the per-run cache with the existing WS3 bare-checkout pattern.

Changes:

  • Replaced shared working-tree clone reuse with shared bare-clone reuse plus per-consumer materialization in github_downloader.py / shared_clone_cache.py.
  • Added unit and integration regression coverage for the parallel subdirectory race and the new bare-clone fallback/materialization paths.
  • Wired the new integration test into scripts/test-integration.sh and added an unreleased changelog entry.
Show a summary per file
File Description
tests/unit/deps/test_shared_clone_cache.py Adds regression/unit coverage for bare-cache behavior, fallback tiers, and materialization semantics.
tests/integration/test_install_subdir_dedup_e2e.py Adds live E2E coverage for parallel sibling subdirectory installs from one repo/ref.
src/apm_cli/deps/shared_clone_cache.py Changes the per-run shared cache to store bare repos and adds a debug invariant check.
src/apm_cli/deps/github_downloader.py Implements bare-clone fallback, per-consumer materialization, and WS2 resolved-commit threading.
scripts/test-integration.sh Adds the new #1126 integration test to the CI integration script.
CHANGELOG.md Adds an unreleased fix entry for the parallel subdirectory race.

Copilot's findings

Comments suppressed due to low confidence (1)

src/apm_cli/deps/github_downloader.py:1027

  • This retry cleanup switched back to raw shutil.rmtree(..., ignore_errors=True) even though this module's _rmtree() helper exists specifically to absorb transient Windows git/AV locks. If a failed tier leaves a briefly locked file behind, the delete becomes a silent partial cleanup and the next git init --bare/git clone --bare attempt can fail on the leftover directory.
            if target.exists():
                shutil.rmtree(target, ignore_errors=True)
  • Files reviewed: 6/6 changed files
  • Comments generated: 5

Comment thread src/apm_cli/deps/github_downloader.py Outdated
Comment on lines +2085 to +2090
_ws2_resolved_commit = self._materialize_from_bare(
shared_bare_path,
temp_clone_path,
ref=ref,
env=self._git_env_dict(),
known_sha=ref if is_commit_sha else None,
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in f5e2fc6. The known_sha=ref optimization is now gated on len(ref) == 40 (matches the same gate added to _bare_action's tier 1 in bare_cache.py). Short SHAs flow through tier 2 where git rev-parse --verify <ref>^{commit} resolves them to a canonical 40-char SHA before update-ref HEAD. Regression test added: tests/unit/deps/test_shared_clone_cache.py::TestBareCloneFallback::test_short_sha_skips_tier1_and_resolves_via_tier2.

Comment thread src/apm_cli/deps/github_downloader.py Outdated
# always start from a clean slate. Behavior change from the
# pre-#1126 implementation - covered by 6.13.
if target.exists():
shutil.rmtree(target, ignore_errors=True)
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in f5e2fc6. Replaced 9 raw shutil.rmtree(..., ignore_errors=True) sites (4 in bare_cache.py, 5 in github_downloader.py -- including line 1027 / _wt_action per the low-confidence finding) with _rmtree(...) which delegates to apm_cli.utils.file_ops.robust_rmtree. That helper retries with backoff and chmod-resets read-only .git/objects/pack/*.pack files (the actual Windows lock case). import shutil removed from both modules. A duplicate _rmtree lives in bare_cache.py to avoid the github_downloader -> bare_cache circular import.

Comment on lines +2055 to +2059
# (host, owner, repo, ref). Each consumer materializes its
# own working tree from the bare; this is subdir-agnostic
# so two parallel consumers requesting different
# subdirectories of the same repo+ref can share one bare
# without racing on sparse-checkout. See design.md sec 5.5.
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated in 1555757. The Parallel Downloads section now describes the WS2 bare + per-consumer materialize pipeline (shared bare clone per (host, owner, repo, ref), then git clone --local --shared --no-checkout into each consumer working tree) instead of the stale sparse-checkout copy. packages/apm-guide/.apm/skills/apm-usage/dependencies.md was checked and contains no sparse/bare/materialize mentions, so no resource update was needed.

Comment on lines +136 to +139
# Lockfile parity: same cache key -> same resolved_commit.
sha_a = getattr(pkg_a, "resolved_commit", None) or getattr(pkg_a, "commit_sha", None)
sha_b = getattr(pkg_b, "resolved_commit", None) or getattr(pkg_b, "commit_sha", None)
if sha_a and sha_b and sha_a != "unknown" and sha_b != "unknown":
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 1555757. The integration test now asserts pkg.resolved_reference.resolved_commit directly (the canonical attr from models/dependency/types.py and models/apm_package.py), with hard checks: each must be non-None, not "unknown", exactly 40 chars, and equal across both subdirs. The previous getattr(pkg, "resolved_commit", None) or getattr(pkg, "commit_sha", None) chain silently passed when both attrs were missing; this would have masked any cache-hit propagation regression.

Comment on lines +92 to +99
github_token = os.getenv("GITHUB_APM_PAT") or os.getenv("GITHUB_TOKEN")
if not github_token:
pytest.skip("GitHub token required (GITHUB_APM_PAT or GITHUB_TOKEN)")

if ref_kind == "sha-https":
ref_value = _resolve_known_sha()
if ref_value is None:
pytest.skip("Could not resolve a known SHA on github/awesome-copilot/main")
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 1555757. _resolve_known_sha() now reads GITHUB_TOKEN (or falls back to GITHUB_APM_PAT) and passes it to the gh api subprocess via env={**os.environ, "GH_TOKEN": token}. CI workers will have the PAT but no gh auth login state, so this prevents the sha-https variant from silently skipping in CI just because gh is unauthenticated.

@danielmeppiel
Copy link
Copy Markdown
Collaborator Author

[Windows portability paper audit]

Paper audit of the diff for Windows-specific risks. PR-CI does not exercise windows-latest (only push-to-main does), so this is the only Windows-portability check before merge.

# Area Verdict
1 Path handling (backslash/forward, Path vs str, drive letters) OK -- all subprocess args are list-form [git_exe, "-C", str(path), ...], no shell=True, no manual quoting; git-for-Windows accepts both slash forms.
2 git clone --local --shared cross-volume / alternates OK -- bare and consumer are both rooted at get_apm_temp_dir() (same volume), so --local hardlinks via objects/info/alternates; no cross-volume fallback path exercised.
3 GitPython file-handle leaks (Repo.close()) OK -- _materialize_from_bare resolves SHA via git --git-dir <bare> rev-parse HEAD (subprocess), never opens Repo(consumer_dir). WS2 path explicitly bypasses the Repo(temp_clone_path).head.commit.hexsha block via _ws2_resolved_commit (github_downloader.py ~L2198-L2213), eliminating the Windows handle-leak vector that motivated the design.
4 shutil.rmtree read-only .pack handling on Windows BUG -- src/apm_cli/deps/github_downloader.py:658, 1027, 1071, 1112 all call raw shutil.rmtree(target, ignore_errors=True) instead of the repo's own _rmtree(path) / robust_rmtree(...) (which install the _on_readonly_retry onerror handler that chmods + retries read-only files). On Windows, .git/objects/pack/*.pack and .git/index are marked read-only by git; raw shutil.rmtree with ignore_errors=True will silently leave them behind. The subsequent git init --bare / git clone --bare then fails with "destination path already exists and is not an empty directory" on the very retry path the rmtree was added for (Tier-1 -> Tier-2 fallback at L1071, plus all transport-plan / ADO-bearer retries at L658/L1027/L1112). Fix: replace each with _rmtree(target) (already imported, defined at L95-103).
5 CRLF + LFS pinning before checkout OK -- _materialize_from_bare clones with --no-checkout, then sets core.autocrlf=false (check=True) plus filter.lfs.{smudge,clean,process}="" and filter.lfs.required=false (check=False), then git checkout HEAD. Order matches design Sec 5.3 / Sec 12.
6 Process invocation safety OK -- every new subprocess.run uses list args (no shell=True) and the resolved get_git_executable() from shutil.which("git"). Spaces in user-profile paths (e.g. C:\Users\First Last\...) are preserved as a single argv element.
7 Test-fixture write_text -> CRLF translation OK -- new write_text calls in tests/unit/deps/test_shared_clone_cache.py (L37, L55-56, L75, L206, L211, L213, L663, L756) all write fake YAML/marker content that is consumed via Path.exists() checks or mocked validate_apm_package; no sha256 / byte-equality assertion runs against any of them. tests/integration/test_install_subdir_dedup_e2e.py has zero write_text calls. CRLF translation is therefore inert here.
8 Locking semantics (in-process threading.Lock) OK -- SharedCloneCache._get_or_create_entry populates entry.path only inside the per-key lock after clone_fn returns; consumers block on the same per-key lock until populate completes, so no read-during-write race against the bare's objects/pack/. Cross-process isolation is preserved by mkdtemp random suffix per SharedCloneCache instance (design Sec 5.7).
9 Token scrub with paths containing spaces OK -- _scrub_bare_remote_url(bare_path, git_exe, env) invokes [git_exe, "-C", str(bare_path), "remote", "set-url", "origin", "redacted://"]. List-form argv handles spaces; check=False + WARN-on-failure logging is appropriate for a defense-in-depth scrub.
10 MAX_PATH (260) on default Windows RISK (low) -- bare path shape is %TEMP%\apm\apm_shared_<owner>_<repo>_<rand6>\bare\objects\pack\pack-<40>.pack (~190-220 chars typical), plus per-consumer %TEMP%\tmp<rand>\consumer\<subdir-tree>\.... Comfortable for github.com short owner/repo names; tight for ADO repos with long org/project/repo path components or deeply-nested monorepo subdirs (<org>/<project>/_git/<repo> can add 60-100 chars before the cache prefix). Pre-existing characteristic, but the new architecture adds one extra dir level (<bare> + <consumer>/<subdir>) per dep vs the old single-tree layout. No fix proposed; suggest tracking as a follow-up to switch the Tier 1 cache path to a content-addressed short hash (e.g. 8-char sha) once the broader cache layout is revisited.

Summary verdict: PASS WITH FOLLOWUPS. One must-fix BUG (item 4: raw shutil.rmtree on the retry paths -- breaks any Windows retry that follows a partial pack download). One low RISK on long-path margins (item 10) that is worth tracking but not blocking.

Recommended fix for item 4 (one-line change at each of 4 sites):

# Replace at L658, L1027, L1071, L1112:
- shutil.rmtree(target, ignore_errors=True)
+ _rmtree(target)

_rmtree is already defined in this file (L95-103) and delegates to robust_rmtree with the chmod-retry onerror callback. No new imports required.

Tracking issue recommended (not opened by this audit -- maintainer call) so item 10 doesn't get lost. Item 4 should land before merge given Windows CI runs only post-merge on main/tags.

(Audit performed against tip d5f51012; bare clone path = src/apm_cli/deps/shared_clone_cache.py + src/apm_cli/deps/github_downloader.py.)

Daniel Meppiel and others added 3 commits May 4, 2026 19:19
…ear file-length guardrail

Splits github_downloader.py (was 2751 lines, exceeded 2400-line CI guardrail)
by moving bare-cache helpers and the clone-failure error-message builder
into a new src/apm_cli/deps/bare_cache.py module:

  - _scrub_bare_remote_url
  - bare_clone_with_fallback
  - materialize_from_bare
  - clone_with_fallback (parameterized via repo_cls for test patchability)
  - build_clone_failure_message

GitHubPackageDownloader retains thin delegating instance methods
(_clone_with_fallback, _bare_clone_with_fallback, _materialize_from_bare)
so existing patch.object call sites continue to work unchanged.

Existing tests that patch 'apm_cli.deps.github_downloader.Repo' continue
to function because the thin wrappers resolve Repo from the
github_downloader module namespace at call time and pass it as repo_cls.

github_downloader.py: 2751 -> 2383 lines.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…n hardening (#1135)

* Short-SHA correctness (Copilot review): bare-cache tier 1 (init+fetch
  by SHA) requires a full 40-char SHA — git's smart-HTTP cannot resolve
  abbreviations and update-ref refuses partial refs. Skip tier 1 for
  short SHAs and resolve them via tier-2 'rev-parse --verify <ref>^{commit}',
  feeding the canonical 40-char SHA into update-ref HEAD. Also gate the
  consumer-side 'known_sha=ref' optimization on len==40.

* Windows portability (Copilot review): replace 9 raw shutil.rmtree
  call sites in bare_cache.py and github_downloader.py with the local
  _rmtree helper that delegates to robust_rmtree (chmod-retries
  read-only .git/objects/pack/* files). Adds a duplicate _rmtree to
  bare_cache.py to avoid the github_downloader -> bare_cache circular
  import.

* FETCH_HEAD supply-chain scrub (panel follow-up): tier-1 init+fetch
  leaves the tokenized URL inside .git/FETCH_HEAD even after the
  remote.origin.url config redaction. Truncate FETCH_HEAD to empty in
  _scrub_bare_remote_url so the token does not survive on disk.

* ADO bearer retry on bare 401 (panel follow-up): subprocess
  CalledProcessError str() does NOT include stderr, so the existing
  401/Authentication-failed marker check in _execute_transport_plan
  never matched the bare-clone path. Append e.stderr to err_msg so
  bearer retry composes with bare clones identically to the
  GitPython-based working-tree path.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…update bare+materialize pipeline (#1135)

* Bare-cache short-SHA test: assert tier 1 (init+fetch) is skipped for
  abbreviated SHAs and the resolved 40-char SHA from rev-parse drives
  update-ref HEAD, not the user-supplied abbreviation.

* Bare-cache tier-2 test: stub rev-parse --verify stdout with the full
  40-char SHA so the test exercises the new 'use resolved SHA, not
  input ref' code path. Existing tier-1 fixture now uses a real
  40-char hex SHA (the prior fixture was 43 chars, which is now
  correctly rejected by the len()==40 gate).

* FETCH_HEAD scrub tests: assert _scrub_bare_remote_url truncates
  FETCH_HEAD when present (defense-in-depth supply-chain) and is a
  no-op when absent (tier-2 full-clone path).

* ADO bearer 401 retry on bare clone: end-to-end test that PAT failure
  with 'Authentication failed' in stderr triggers bearer retry through
  the bare clone action and the stale-PAT diagnostic fires on success.

* Integration test (subdir dedup E2E):
  - lockfile parity: assert pkg.resolved_reference.resolved_commit
    is a 40-char SHA and identical across both subdirs (was a soft
    getattr fallback that silently passed if attr was missing).
  - sha-https variant: pass GH_TOKEN env to the gh api subprocess so
    it works under CI workers without ambient gh auth login state.

* Docs: update Parallel Downloads section to describe the WS2 bare +
  per-consumer materialize pipeline (replaces stale sparse-checkout
  copy).

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@danielmeppiel
Copy link
Copy Markdown
Collaborator Author

Folded in Copilot review + panel follow-ups

Commits: f5e2fc6 (src), 1555757 (tests + docs)

Copilot review findings (5/5 addressed)

  • Short-SHA correctness -- gated tier 1 on len(ref) == 40; tier 2 resolves abbreviations via rev-parse --verify <ref>^{commit} and feeds the canonical 40-char SHA into update-ref HEAD.
  • Windows portability -- 9 shutil.rmtree(..., ignore_errors=True) sites swapped for _rmtree(...) (delegates to robust_rmtree, chmod-resets read-only .git/objects/pack/*.pack files). import shutil removed from both modules.
  • Docs -- docs/.../guides/dependencies.md Parallel Downloads section now describes the bare + per-consumer materialize pipeline (replaces stale sparse-checkout copy).
  • Test SHA attr -- integration test asserts pkg.resolved_reference.resolved_commit directly with hard 40-char + equality checks (was a soft getattr chain).
  • gh api token -- _resolve_known_sha() passes GH_TOKEN to subprocess so the sha-https variant works under CI without ambient gh auth login.

Panel follow-ups

  • FETCH_HEAD scrub (supply-chain) -- _scrub_bare_remote_url now truncates .git/FETCH_HEAD so the tokenized URL does not survive on disk after tier-1 init+fetch. Defense-in-depth on top of the existing remote.origin.url redaction.
  • ADO bare 401 retry -- _execute_transport_plan now appends e.stderr to err_msg before the auth-marker check. str(subprocess.CalledProcessError) does NOT include stderr, so the existing retry logic never fired on the bare-clone path. New unit test asserts PAT 401 -> bearer retry composes correctly with _bare_action.

Verification

  • uv run --extra dev ruff check src/ tests/ silent
  • uv run --extra dev ruff format --check src/ tests/ silent
  • 7577 unit tests pass, 0 failures
  • File-length guardrail: github_downloader.py at 2399 lines (under 2400)

Each Copilot review thread received an inline reply pointing at the fix commit. Watching CI for green confirmation.

@danielmeppiel
Copy link
Copy Markdown
Collaborator Author

Local verification matrix

Re-ran the full suite locally on macOS / Python 3.12.9 against real github/awesome-copilot on commit 1555757b to give the maintainer something concrete beyond CI (which on this repo runs Lint + Linux unit tests on PRs but does not exercise integration on Windows or against real remotes).

Integration tests against real GitHub remote (github/awesome-copilot)

Two sibling subdirs (skills/acquire-codebase-knowledge + skills/agent-governance) installed in parallel, sharing the same (host, owner, repo, ref) cache key. Pre-fix this was the deterministic reproducer for the race; post-fix it must succeed for all three ref-resolution paths.

Test Ref kind What it proves Result
test_two_subdirs_same_repo_parallel[symbolic-https-main] ref="main" The original race fix end-to-end: two consumers of the same repo@main resolve sibling subdirs without colliding PASS
test_two_subdirs_same_repo_parallel[default-branch-None] no ref Default-branch path through _bare_clone_with_fallback (tier 1 shallow clone) PASS
test_two_subdirs_same_repo_parallel[sha-https-RESOLVE_AT_RUNTIME] 40-char commit SHA resolved at fixture-setup via gh api 3-tier SHA fallback in bare_cache.py (init+fetch → init+fetch+rev-parse → full clone+rev-parse) and the bare-derived resolved-commit lockfile parity PASS

3 passed in 17.79s.

Unit tests for the new module surface (tests/unit/deps/test_shared_clone_cache.py)

Class Coverage area Tests Result
TestSharedCloneCache Cache fundamentals: single-clone, dedup, divergent refs, failure propagation, lock serialization, lifecycle 6 PASS
TestDownloaderSharedCloneIntegration Two subdir deps share a single bare clone via the downloader 1 PASS
TestBareCacheRaceCondition Barrier-deterministic parallel-different-subdirs reproducer (the bug) 1 PASS
TestBareCloneFallback All 3 tiers: SHA tier 1 init+fetch, SHA tier 2 fetch-rejection, short-SHA tier 2, symbolic shallow tier 1 4 PASS
TestMaterializeFromBare git clone --local --shared --no-checkout semantics: real bare materialization, SHA from bare not consumer (lifetime invariant), known-SHA shortcut, LFS smudge disabled, autocrlf=false pinned before checkout 5 PASS
TestSharedCloneCacheBareInvariant APM_DEBUG accepts bare / rejects non-bare clone shapes 2 PASS
TestExecuteTransportPlanWtAction Working-tree action handles missing target 1 PASS
TestBareCloneRetryRmtree _bare_action rmtrees target before init (Windows pack-file portability) 1 PASS
TestInvalidSubdirErrorWording Typo subdir surfaces Subdirectory '...' not found (UX contract) 1 PASS
TestBareScrubFetchHead Defense-in-depth: _scrub_bare_remote_url truncates FETCH_HEAD when present and is no-op when absent 2 PASS
TestAdoBareBearerRetry ADO PAT 401 → bearer retry on the bare-clone path (this test caught a real prod bug: str(CalledProcessError) omits stderr, so the retry never fired before this commit) 1 PASS

25 passed in 4.06s.

Regression sweep (no behavior changes outside the WS2 path)

Ran the broader install + dependency real-remote integration suites to confirm the bare+materialize pipeline does not regress unrelated install flows (lockfile, dry-run, verbose-redaction, link-mode, local-bundle, single/multi-dep install, branch refs, error handling, network failure, CLI commands, dep update workflow, cross-platform download).

Suite Tests Result
tests/integration/test_apm_dependencies.py 15 13 pass, 1 skip (binary_compatibility_with_dependencies requires built binary), 1 unrelated env failure (test_authentication_token_handling asserts token starts with github_pat_; local gh auth token returns a gho_ OAuth token — pure env quirk, exists on main too)
tests/integration/test_install_local_bundle_e2e.py 19 PASS
tests/integration/test_install_dry_run_e2e.py 5 PASS
tests/integration/test_install_verbose_redaction_e2e.py 4 PASS
tests/integration/test_install_with_links.py 2 PASS

30/30 install-flow regression tests pass; 13/13 real-remote dependency-install behavior tests pass.

Race-coverage stratification (the "simulate properly" gate)

Two layers, intentional:

Layer Mechanism Why both Status
Unit (deterministic) threading.Barrier(2) in TestBareCacheRaceCondition forces lock contention in SharedCloneCache.get_or_clone regardless of timing Fast, hermetic, runs on every PR via CI PASS
Integration (real BFS) ThreadPoolExecutor(max_workers=4) BFS hits github/awesome-copilot with two sibling subdirs sharing one cache key Catches integration drift that mocked unit tests can't see; verifies the production code path against a real git remote PASS

Honest gaps (not run, declared)

  • Windows execution: build-release.yml (the only workflow with windows-latest) does not trigger on PRs in this repo, only on push-to-main and tags. Windows coverage on this branch is via paper audit + the _rmtree/CRLF/LFS pinning fixes; first real Windows run will happen post-merge on main.
  • ADO E2E: no live ADO remote in the integration suite; ADO bearer-retry path covered by unit test test_bare_clone_recovers_via_ado_bearer_after_pat_401 only.

Worktree: /Users/danielmeppiel/Repos/apm-1126-fix on 1555757b.

@danielmeppiel danielmeppiel merged commit 2753f72 into main May 4, 2026
32 checks passed
@danielmeppiel danielmeppiel deleted the fix/1126-bare-cache-materialization branch May 4, 2026 18:45
danielmeppiel added a commit that referenced this pull request May 5, 2026
Roll forward the four PRs merged since v0.12.1:

- #1137 feat(audit): default-on integration drift detection
- #1135 fix(deps): subdir-agnostic bare cache (parallel sparse-checkout race)
        also resolves duplicate report #1140 (ADO sub-path manifestation)
- #1129 docs: first-package guide accuracy (includes: auto, skill paths)
- #1127 docs: APM's role for skills, plugins-as-packages, ADO sub-paths

Bump pyproject.toml + uv.lock 0.12.1 -> 0.12.2 and convert the
Unreleased CHANGELOG block into the 0.12.2 section, with a single
'so what' line per merged PR per the changelog contract.

Co-authored-by: Daniel Meppiel <copilot-rework@github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[BUG] Parallel sparse-checkout race condition when two subdirectory deps reference the same repository

2 participants